{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook will show you how to organize in 2D a set of documents/articles/posts so that articles with similar content are grouped near to each other. The example I am using is a set of Wikipedia articles of [Political ideologies](https://en.wikipedia.org/wiki/List_of_political_ideologies), but in principle it can be used for any set of documents. \n",
    "\n",
    "The result of this notebook [can be viewed live here](https://www.genekogan.com/works/wiki-tSNE/).\n",
    "\n",
    "### Procedure\n",
    "\n",
    "The pipeline consists of two steps.\n",
    "\n",
    "1) Convert all of the articles to a [tf-idf matrix](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).\n",
    "\n",
    "tf-idf stands for \"term frequency inverse document frequency\" and is commonly used in natural language processing applications dealing with large collections of documents. A tf-idf matrix is an $n * m$ sparse matrix consisting of $n$ rows, corresponding to our $n$ documents, and $m$ columns, corresponding to the $m$ unique \"terms\" (usually just words but can be n-grams or other kinds of tokens) that appear in the entire corpus.\n",
    "\n",
    "Each entry in the matrix, $tfidf(i,j)$ can be interpreted as the \"relative importance\" of term $j$ to document $i$.  It is calculated as\n",
    "\n",
    "$$tfidf(i,j) = tf(i,j)*idf(i,j)$$\n",
    "\n",
    "$tf(i, j)$ is the \"term frequency,\" i.e. the percentage of terms in document $i$ which are term $j$. For example, in the document \"the cat in the hat\", the term \"the\" has a $tf$ of (2 / 5) = 0.4. Thus $tf$ is high when the term is frequently found in the document.\n",
    "\n",
    "$idf(i, j)$, not to be confused with [this IDF](https://twitter.com/idfspokesperson/status/547144026445471744) is the \"inverse document frequency.\" It is given by:\n",
    "\n",
    "$$idf(i, j) = log(\\frac{N}{n_j})$$\n",
    "\n",
    "where $N$ is the number of documents in the corpus and $n_j$ is the number of documents which contain term $j$. When $n_j$ is high, this value shrinks towards 0. This happens when the term frequently appears in many or all documents, thus common terms like \"the\", \"a\", \"it\", etc will have a low $idf$ score because they appear in most documents. Conversely, when the term rarely appears in documents ($n_j$ is low), then $idf$ score will be high. These tend to be special or topic-specific words which appear in few of the documents.\n",
    "\n",
    "So intuitively, the $tfidf$ score for a term in a document goes higher if the term appears frequently in the document and appears infrequently in other documents (so that term is important to that document).\n",
    "\n",
    "2) This gives you a high-dimensional matrix of n documents which can be reduced to 2 dimensions using the [t-SNE](https://lvdmaaten.github.io/tsne/) dimensionality reduction technique. A better description of how t-SNE works can be found in the link.\n",
    "\n",
    "### Installation\n",
    "\n",
    "You need [nltk](http://www.nltk.org/install.html) and [scikit-learn](http://scikit-learn.org/) to run most of this notebook. You need to also download the 'punkt' dataset from nltk.\n",
    "\n",
    "    pip install -U nltk\n",
    "    pip install -U scikit-learn\n",
    "    python -c \"import nltk; nltk.download('punkt')\"\n",
    "\n",
    "Also, if you don't already have [numpy](http://www.numpy.org/), you should install it as well (it's only used to normalize the data later). \n",
    "\n",
    "    pip install numpy\n",
    "\n",
    "Additionally, if you are following this example and wish to extract articles from Wikipedia, you need the python [wikipedia](https://pypi.python.org/pypi/wikipedia/) library. \n",
    "\n",
    "    pip install wikipedia\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First make sure all these imports work (minus wikipedia if you intend to use another corpus). We will assume wikipedia from here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to\n",
      "[nltk_data]     /Users/itpstudent/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /Users/itpstudent/nltk_data...\n",
      "[nltk_data]   Unzipping corpora/stopwords.zip.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import string\n",
    "import os\n",
    "import time\n",
    "import pickle\n",
    "import json\n",
    "import re\n",
    "import wikipedia\n",
    "import nltk\n",
    "import numpy as np\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.manifold import TSNE\n",
    "from sklearn.preprocessing import normalize\n",
    "from sklearn.decomposition import TruncatedSVD\n",
    "\n",
    "nltk.download('punkt')\n",
    "nltk.download('stopwords')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we are going to cluster all of the links found in the Wikipedia page \"[Vital articles](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles).\" First, we will open the page in main, then add all of the raw text of the articles into a dictionary called token_dict."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "main = wikipedia.page('Wikipedia:Vital articles')\n",
    "#https://en.wikipedia.org/wiki/Wikipedia:1,000_core_topics\n",
    "#https://en.wikipedia.org/wiki/User:West.andrew.g/2016_Popular_pages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "getting text for article 0/997 : 0\n",
      "getting text for article 20/997 : Age of Enlightenment\n",
      "getting text for article 40/997 : Amazon rainforest\n",
      "getting text for article 60/997 : Anthropology\n",
      "getting text for article 80/997 : Association football\n",
      "getting text for article 100/997 : Benjamin Franklin\n",
      "getting text for article 120/997 : Brazil\n",
      "getting text for article 140/997 : Carbon\n",
      "getting text for article 160/997 : Chemical reaction\n",
      "getting text for article 180/997 : Coal\n",
      "getting text for article 200/997 : Confucius\n",
      "getting text for article 220/997 : DNA\n",
      "getting text for article 240/997 : Discrimination\n",
      "getting text for article 260/997 : Economics\n",
      "getting text for article 280/997 : Engineering\n",
      "getting text for article 300/997 : Existence\n",
      "getting text for article 320/997 : Flood\n",
      "getting text for article 340/997 : Function (mathematics)\n",
      "getting text for article 360/997 : Geometry\n",
      "getting text for article 380/997 : Great Wall of China\n",
      "getting text for article 400/997 : Hippocrates\n",
      "getting text for article 420/997 : History of the world\n",
      "getting text for article 440/997 : Ibn Khaldun\n",
      "getting text for article 460/997 : Inorganic chemistry\n",
      "getting text for article 480/997 : Jainism\n",
      "getting text for article 500/997 : Judaism\n",
      "getting text for article 520/997 : Leo Tolstoy\n",
      "getting text for article 540/997 : Ludwig van Beethoven\n",
      "getting text for article 560/997 : Marketing\n",
      "getting text for article 580/997 : Medicine\n",
      "getting text for article 600/997 : Middle Ages\n",
      "getting text for article 620/997 : Moon landing\n",
      "getting text for article 640/997 : Nationalism\n",
      "getting text for article 660/997 : Nikola Tesla\n",
      "getting text for article 680/997 : Organic chemistry\n",
      "getting text for article 700/997 : Personal name\n",
      "getting text for article 720/997 : Play (activity)\n",
      "getting text for article 740/997 : Prime number\n",
      "getting text for article 760/997 : Ramesses II\n",
      "getting text for article 780/997 : Robotics\n",
      "getting text for article 800/997 : Scientific method\n",
      "getting text for article 820/997 : Short story\n",
      "getting text for article 840/997 : Socialism\n",
      "getting text for article 860/997 : Spanish language\n",
      "getting text for article 880/997 : Strong interaction\n",
      "getting text for article 900/997 : Tank\n",
      "getting text for article 920/997 : Theravada\n",
      "getting text for article 940/997 : Tuberculosis\n",
      "getting text for article 960/997 : Virus\n",
      "getting text for article 980/997 : Wind power\n"
     ]
    }
   ],
   "source": [
    "token_dict = {}\n",
    "for i, article in enumerate(main.links):\n",
    "    if article not in token_dict:\n",
    "        if i%20==0:\n",
    "            print \"getting text for article %d/%d : %s\"%(i, len(main.links), article)\n",
    "        try:\n",
    "            text = wikipedia.page(article)\n",
    "            token_dict[article] = text.content\n",
    "        except:\n",
    "            print \" ==> error processing \"+article\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can save the text to disk in the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "pickle.dump(token_dict, open('fulltext_WikiVitalArticles.p', 'wb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "later you can retrieve it like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "token_dict = pickle.load(open('fulltext_WikiVitalArticles.p', 'rb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next sell will calculate the SVD decomposition of the tf-idf matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calculating tf-idf\n",
      "reducing tf-idf to 500 dim\n",
      "done\n"
     ]
    }
   ],
   "source": [
    "def tokenize(text):\n",
    "    text = text.lower()    # lower case\n",
    "    text = re.sub(r\"[%s\\n\\t]+\"%string.punctuation, ' ', text)  # remove punctuation\n",
    "    text = re.sub(r\"[ ]+\",  \" \",  text)  # remove extra spaces\n",
    "    text = text.translate(string.punctuation)  # punctuation\n",
    "    tokens = nltk.word_tokenize(text)\n",
    "    tokens = [t for t in tokens if not t in stopwords.words('english')] # stopwords\n",
    "    stems = [PorterStemmer().stem(t) for t in tokens]\n",
    "    return stems\n",
    "\n",
    "# calculate tfidf (might take a while)\n",
    "print(\"calculating tf-idf\")\n",
    "tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')\n",
    "tfs = tfidf.fit_transform(token_dict.values())\n",
    "\n",
    "print(\"reducing tf-idf to 500 dim\")\n",
    "tfs_reduced = TruncatedSVD(n_components=500, random_state=0).fit_transform(tfs)\n",
    "print(\"done\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can save the calculation to disk in the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "pickle.dump(tfs_reduced, open('tfidf_WikiVitalArticles.p', 'wb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can load it back later like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfs_reduced = pickle.load(open('tfidf_WikiVitalArticles.p', 'rb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now calculate t-SNE on the reduced feature vectors and normalize to (0,1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[t-SNE] Computing 151 nearest neighbors...\n",
      "[t-SNE] Indexed 997 samples in 0.005s...\n",
      "[t-SNE] Computed neighbors for 997 samples in 0.780s...\n",
      "[t-SNE] Computed conditional probabilities for sample 997 / 997\n",
      "[t-SNE] Mean sigma: 0.381256\n",
      "[t-SNE] Computed conditional probabilities in 0.052s\n",
      "[t-SNE] Iteration 50: error = 72.6430206, gradient norm = 0.3293283 (50 iterations in 1.764s)\n",
      "[t-SNE] Iteration 100: error = 74.4200134, gradient norm = 0.3027419 (50 iterations in 1.741s)\n",
      "[t-SNE] Iteration 150: error = 75.8908005, gradient norm = 0.2851454 (50 iterations in 1.800s)\n",
      "[t-SNE] Iteration 200: error = 75.6168747, gradient norm = 0.3014873 (50 iterations in 1.789s)\n",
      "[t-SNE] Iteration 250: error = 76.6445007, gradient norm = 0.2806742 (50 iterations in 1.782s)\n",
      "[t-SNE] KL divergence after 250 iterations with early exaggeration: 76.644501\n",
      "[t-SNE] Iteration 300: error = 1.5614842, gradient norm = 0.0028795 (50 iterations in 1.657s)\n",
      "[t-SNE] Iteration 350: error = 1.4017555, gradient norm = 0.0009432 (50 iterations in 1.724s)\n",
      "[t-SNE] Iteration 400: error = 1.3293533, gradient norm = 0.0004273 (50 iterations in 1.510s)\n",
      "[t-SNE] Iteration 450: error = 1.3067455, gradient norm = 0.0002495 (50 iterations in 1.451s)\n",
      "[t-SNE] Iteration 500: error = 1.2943467, gradient norm = 0.0002196 (50 iterations in 1.454s)\n",
      "[t-SNE] Iteration 550: error = 1.2851582, gradient norm = 0.0002550 (50 iterations in 1.453s)\n",
      "[t-SNE] Iteration 600: error = 1.2779895, gradient norm = 0.0003128 (50 iterations in 1.447s)\n",
      "[t-SNE] Iteration 650: error = 1.2735915, gradient norm = 0.0001538 (50 iterations in 1.452s)\n",
      "[t-SNE] Iteration 700: error = 1.2691839, gradient norm = 0.0001373 (50 iterations in 1.471s)\n",
      "[t-SNE] Iteration 750: error = 1.2658666, gradient norm = 0.0001648 (50 iterations in 1.447s)\n",
      "[t-SNE] Iteration 800: error = 1.2637744, gradient norm = 0.0001543 (50 iterations in 1.446s)\n",
      "[t-SNE] Iteration 850: error = 1.2607478, gradient norm = 0.0001401 (50 iterations in 1.472s)\n",
      "[t-SNE] Iteration 900: error = 1.2588487, gradient norm = 0.0000997 (50 iterations in 1.463s)\n",
      "[t-SNE] Iteration 950: error = 1.2562903, gradient norm = 0.0001329 (50 iterations in 1.483s)\n",
      "[t-SNE] Iteration 1000: error = 1.2551097, gradient norm = 0.0000897 (50 iterations in 1.480s)\n",
      "[t-SNE] Error after 1000 iterations: 1.255110\n"
     ]
    }
   ],
   "source": [
    "# calculate t-SNE\n",
    "tsne = TSNE(n_components=2, perplexity=50, verbose=2).fit_transform(tfs_reduced)\n",
    "\n",
    "# save to json file\n",
    "x_axis, y_axis = tsne[:, 0], tsne[:, 1]\n",
    "x_norm = (x_axis-np.min(x_axis)) / (np.max(x_axis) - np.min(x_axis))\n",
    "y_norm = (y_axis-np.min(y_axis)) / (np.max(y_axis) - np.min(y_axis))\n",
    "data = {\"x\":[float(x) for x in x_norm.tolist()], \"y\":[float(y) for y in y_norm.tolist()], \"names\":token_dict.keys()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save to json for future-keeping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('tsne_wikiVitalArticles.json', 'w') as outfile:\n",
    "    json.dump(data, outfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also convert the t-SNE to an `nx` by `ny` grid assignment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "nx, ny = 32, 31\n",
    "\n",
    "import rasterfairy\n",
    "grid_assignment = rasterfairy.transformPointCloud2D(tsne[0:nx*ny, :], target=(nx, ny))[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell will create an HTML file with the gridded wikipedia articles arranged by similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "grid_sorted = sorted(range(len(grid_assignment)), key=lambda k: grid_assignment[k][1]*nx + grid_assignment[k][0])\n",
    "keys = list(token_dict.keys())\n",
    "\n",
    "links_grid = [[0 for x in range(nx)] for y in range(ny)] \n",
    "for i, g in enumerate(grid_assignment):\n",
    "    links_grid[int(g[1])][int(g[0])] = keys[i]\n",
    "\n",
    "table_html = '<table>\\n'\n",
    "for row in links_grid:\n",
    "    table_html += '\\t<tr>\\n'\n",
    "    for col in row:\n",
    "        table_html += '\\t\\t<td><a href=\\\"https://en.wikipedia.org/wiki/%s\\\">%s</a></td>\\n' % (col, col)\n",
    "    table_html += '\\t</tr>\\n'\n",
    "table_html += '</table>\\n'\n",
    "\n",
    "html = '''\n",
    "    <head>\n",
    "    <style>\n",
    "        body {\n",
    "          padding-top: 80px;\n",
    "          text-align: center;\n",
    "          font-family: monaco, monospace;\n",
    "          background-size: cover;\n",
    "        }\n",
    "        table {\n",
    "            text-align: center;\n",
    "        }\n",
    "        tr {\n",
    "          background-color:#ff0;\n",
    "        }\n",
    "        td {\n",
    "          padding:10px; \n",
    "        }\n",
    "    </style>\n",
    "    </head>\n",
    "    <body>\n",
    "        %s\n",
    "    </body>\n",
    "''' % table_html\n",
    "\n",
    "with open('index.html', 'wb') as text_file:\n",
    "    #text_file.write(html.encode('utf-8'))\n",
    "    text_file.write(html.encode('utf-8'))\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
